Overview

Column

About the Data Set

The data used in this project is from the Analyze Boston Property Assessment series.

After completing the Data Preparation detailed below, our data set has records for 30,082 single family dwellings in Boston. These include property information as of FY2021 and total assessment values from FY2015 to FY2021.

Data Preparation

Note: Initial steps are handled in CS544_BostonProperties_ETL.R

  • Get 11 features from Property Assessment FY2021. See Code Book on this page.
  • Convert TOTAL_VALUE in FY2021 data from string ($300,000.00) to numeric format (300000).
  • Get PID, PTYPE (equivalent to LUC in FY2021 data), and AV_TOTAL (equivalent to TOTAL_VALUE in FY2021 data) from the following:
  • Limit to data about single family dwellings by getting only records where LUC==101 (FY2021) or PTYPE==101 (all other years).
  • For PID in FY2015-FY2017, remove trailing characters and change from string (0100003000_) to numeric (100003000).
  • Rename TOTAL_VALUE (FY2021) or AV_TOTAL (all other years) to VALUE_{YEAR}.
  • Merge VALUE_{YEAR} from all other years onto FY2021 data on PID.
  • Create derived features AMT_CHANGE and PCT_CHANGE to show how much total assessment values changed from 2015 to 2021.
  • Remove complete duplicates.

Column

Code Book

Feature Description
PID Unique 10-digit parcel number. First 2 digits are the ward, digits 3 to 7 are the parcel, and digits 8 to 10 are the sub-parcel.
CITY City of parcel.
ZIPCODE Zip code of parcel.
LUC State class code. Indicates type of property.
OWN_OCC One-character code indicating if owner receives residential exemption as an owner-occupied property.
LIVING_AREA Living area square footage of the property.
YR_BUILT Year property was built.
EXT_COND Exterior condition.
BED_RMS Total number of bedrooms in the structure.
FULL_BTH Total number of full baths in the structure.
VALUE_{YEAR} Total assessed value for property for the given fiscal year. Originally named AV_TOTAL in data sets for FY2015-2020 and TOTAL_VALUE for FY2021.
AMT_CHANGE Amount change in total assessed value between FY2015 and FY2021. (Derived.)
PCT_CHANGE Percent change in total assessed value between FY2015 and FY2021. (Derived.)
Property_Age Age of the property in FY2021. (Derived.)

Analysis

Column

Analysis

Click on the tabs to view associated charts.

Property Age

  • The oldest property was built in 1710 and newest was built in 2019
  • Most of the single family dwellings were built 50 to 130 years ago.
  • Average age of single family dwellings is 92 years, but with a right-tailed distribution, the center of the data should be the median which is 94 years.

Exterior Condition

  • Majority of the properties have “Average” condition, following with “Good”
  • Properties with “Excellent” and “Poor” conditions together are less than 0.3% of all the single family dwellings.

Owner Occupied

  • The majority of single family dwellings in Boston are owner-occupied as of FY2021.

Bedrooms

  • The median number of bedrooms in single family dwellings is 3.
  • While most have 2 to 5 bedrooms, there are outliers with up to 12 bedrooms recorded.

Correlations

  • Total living area seems to have a positive relationship with total assessment values.
  • But year built, number of bedrooms and bathrooms don’t seem to have relationship with FY2021 total assessment value.
  • Living area, year built, numbers of bedrooms and bathrooms don’t seem to have correlations with each other.

Column

Property Age

Exterior Condition

Owner Occupied

Bedrooms

Correlations

City & Total Value

Row

City & Total Assessment Values

Overall, it is easy to see that single family dwellings in the city of Boston have the highest median total assessment values and the greatest change in dollar amount from FY2015 to FY2021.

The upper fence of values in FY2021 for the city of Boston is $7.7M - greater than any outliers for any other cities. Likewise, the upper fence for change in amount from 2015 to 2021 is $2.5M - again, greater than any outliers for any other cities.

However, it is worth noting that when considering the percent change between the years, the median percent change is actually greater for other cities.

2021 Total Assessment Values

Row

Total Assessment Value Change from 2015 to 2021 ($)

Total Assessment Value Change from 2015 to 2021 (%)

Total Value

Column

Total Assessment Values

After drawing 6000 samples with different sizes from 300 to 600, the averages of sample means are showing normal distributions. Total assessment values of 2021 does have applicability of the Central Limit Theorem.

Distribution of FY2021 Total Assessment Values

Column

Central Limit Theorem

Sampling

Column

Sampling FY2021 Total Assessment Values

Sample (size = 500) Mean of Total Assessment Values
Full data set $688,744
Simple random sampling without replacement $727,306
Simple random sampling with replacement $680,200
Systematic sampling $668,534
Stratified sampling by City $713,485
  • If using SRS without replacement instead of the whole data set, the mean of the assessment values would be much larger.
  • If using SRS with replacement, the mean of assessment values would be smaller but not that much.
  • With systematic sampling, the mean is less. The difference is smaller than SRS without replacement, but greater than SRS with replacement.
  • With stratified sampling by city, the mean is greater, but less than SRS with replacement.

Distribution of 2021 Total Assessment Values

Column

Simple Random Sampling Without Replacement: ATotal Assessment Values

Simple Random Sampling With Replacement: Total Assessment Values

Systematic Sampling: Total Assessment Values

Stratified Sampling by City: Total Assessment Values

Sampling (City)

Column

Sampling FY2021 Total Assessment Values: City Representation

These charts show the representation of the cties in each of our samples.

It is worth noting that neither Brookline or Dedham are represented in any of the samples. Both had very few records in the original data set.

Otherwise, all sampling methods actually do well at representing the cities proportionally.

City Representation

Column

Simple Random Sampling Without Replacement: City Representation

Simple Random Sampling With Replacement: City Representation

Systematic Sampling: City Representation

Stratified Sampling by City: City Representation

---
title: "Analysis of Boston Properties"
author: "Dawn Graham, Sylvie Xiang"
output: 
  flexdashboard::flex_dashboard:
    vertical_layout: fill
    source_code: embed
---






Overview
=====================================  
    
Column {data-width=600}
-------------------------------------
    
### About the Data Set {data-height=120}

The data used in this project is from the [Analyze Boston Property Assessment](https://data.boston.gov/dataset/property-assessment){target="_blank"} series.

After completing the Data Preparation detailed below, our data set has records for **30,082 single family dwellings in Boston**. These include property information as of FY2021 and total assessment values from FY2015 to FY2021.

### Data Preparation {data-height=500}

```{r}
library(plotly)
library(plyr)

## Read in data
pa <- read.csv("https://raw.githubusercontent.com/dawngraham/cs544-boston-properties/main/pa.csv",colClasses=c("ZIPCODE"="character"))

# Remove duplicate rows
pa <- pa[!duplicated(pa),]

# Derive change in values from 2015 to 2021
pa$AMT_CHANGE <- pa$VALUE_2021 - pa$VALUE_2015
pa$PCT_CHANGE <- round(pa$AMT_CHANGE / pa$VALUE_2015 * 100)
```   

Note: Initial steps are handled in [CS544_BostonProperties_ETL.R](https://github.com/dawngraham/cs544-boston-properties/blob/main/CS544_BostonProperties_ETL.R){target="_blank"}

- Get 11 features from [Property Assessment FY2021](https://data.boston.gov/dataset/property-assessment/resource/c4b7331e-e213-45a5-adda-052e4dd31d41){target="_blank"}. See Code Book on this page.
- Convert `TOTAL_VALUE` in FY2021 data from string (`$300,000.00`) to numeric format (`300000`).
- Get `PID`, `PTYPE` (equivalent to `LUC` in FY2021 data), and `AV_TOTAL` (equivalent to `TOTAL_VALUE` in FY2021 data) from the following:
  - [Property Assessment FY2020](https://data.boston.gov/dataset/property-assessment/resource/8de4e3a0-c1d2-47cb-8202-98b9cbe3bd04){target="_blank"}
  - [Property Assessment FY2019](https://data.boston.gov/dataset/property-assessment/resource/695a8596-5458-442b-a017-7cd72471aade){target="_blank"}
  - [Property Assessment FY2018](https://data.boston.gov/dataset/property-assessment/resource/fd351943-c2c6-4630-992d-3f895360febd){target="_blank"}
  - [Property Assessment FY2017](https://data.boston.gov/dataset/property-assessment/resource/062fc6fa-b5ff-4270-86cf-202225e40858){target="_blank"}
  - [Property Assessment FY2016](https://data.boston.gov/dataset/property-assessment/resource/cecdf003-9348-4ddb-94e1-673b63940bb8){target="_blank"}
  - [Property Assessment FY2015](https://data.boston.gov/dataset/property-assessment/resource/bdb17c2b-e9ab-44e4-a070-bf804a0e1a7f){target="_blank"}
- Limit to data about **single family dwellings** by getting only records where `LUC==101` (FY2021) or `PTYPE==101` (all other years).
- For `PID` in FY2015-FY2017, remove trailing characters and change from string (`0100003000_`) to numeric (`100003000`).
- Rename `TOTAL_VALUE` (FY2021) or `AV_TOTAL` (all other years) to `VALUE_{YEAR}`.
- Merge `VALUE_{YEAR}` from all other years onto FY2021 data on `PID`.
- Create derived features `AMT_CHANGE` and `PCT_CHANGE` to show how much total assessment values changed from 2015 to 2021.
- Remove complete duplicates.
   
Column {data-width=400}
-------------------------------------
   
### Code Book
| Feature | Description |
|---|-----------|
| `PID` | Unique 10-digit parcel number. First 2 digits are the ward, digits 3 to 7 are the parcel, and digits 8 to 10 are the sub-parcel.|
| `CITY` | City of parcel. |
| `ZIPCODE` | Zip code of parcel. |
| `LUC` | State class code. Indicates type of property. |
| `OWN_OCC` |  One-character code indicating if owner receives residential exemption as an owner-occupied property. |
| `LIVING_AREA` | Living area square footage of the property. |
| `YR_BUILT` | Year property was built. |
| `EXT_COND` | Exterior condition. |
| `BED_RMS` | Total number of bedrooms in the structure. |
| `FULL_BTH` | Total number of full baths in the structure. |
| `VALUE_{YEAR}` | Total assessed value for property for the given fiscal year. Originally named `AV_TOTAL` in data sets for FY2015-2020 and `TOTAL_VALUE` for FY2021. |
| `AMT_CHANGE` | Amount change in total assessed value between FY2015 and FY2021. (Derived.) |
| `PCT_CHANGE` | Percent change in total assessed value between FY2015 and FY2021. (Derived.) |
| `Property_Age` | Age of the property in FY2021. (Derived.) |

Analysis {data-orientation=columns}
=====================================     
   
Column 
-------------------------------------
    
### Analysis
    
Click on the tabs to view associated charts.

#### Property Age

- The oldest property was built in 1710 and newest was built in 2019
- Most of the single family dwellings were built 50 to 130 years ago.
- Average age of single family dwellings is 92 years, but with a right-tailed distribution, the center of the data should be the median which is 94 years.

#### Exterior Condition

- Majority of the properties have "Average" condition, following with "Good"
- Properties with "Excellent" and "Poor" conditions together are less than 0.3% of all the single family dwellings. 

#### Owner Occupied

- The majority of single family dwellings in Boston are owner-occupied as of FY2021.

#### Bedrooms

- The median number of bedrooms in single family dwellings is 3.
- While most have 2 to 5 bedrooms, there are outliers with up to 12 bedrooms recorded.

#### Correlations

- Total living area seems to have a positive relationship with total assessment values.
- But year built, number of bedrooms and bathrooms don't seem to have relationship with FY2021 total assessment value.
- Living area, year built, numbers of bedrooms and bathrooms don't seem to have correlations with each other.


Column {.tabset}
-------------------------------------

### Property Age
    
```{r}
pa$Property_Age <- rep(2021,nrow(pa))-pa$YR_BUILT

hist1 <- hist( pa$Property_Age, breaks = seq(0,320, 10), 
               xlim=c(0,320), ylim = c(0, 5000), col="blue",
               xlab="Property Age (in Years)", main="Single Family Dwellings Age Histogram")

axis(at=seq(0,320,50),side=1)

text(hist1$mids, 29 + hist1$counts, 
     labels = hist1$counts, adj = c(0, 0.5), srt = 90)
```

### Exterior Condition

```{r}
t <- table( pa$EXT_COND)

plt <- barplot(t,ylim=c(0,25000), width = c(10,16,10,10,10),
               xlab="Exterior Conditions", ylab="Freq.",
               main="Single Family Dwellings Exterior Conditions Barplot", 
               col="light blue")

a <- prop.table(t)*100

b <- paste(round(a,2), "%", sep="")
text(plt, 1600+a, labels=b)
```   
 
### Owner Occupied
    
```{r}
pa$OWN_OCC <- mapvalues(pa$OWN_OCC,
                        from=c("Y", "N"),
                        to=c("Yes", "No"))

own_occ <- table(pa$OWN_OCC)

plot_ly(x = names(own_occ),
        y = as.numeric(own_occ),
        type = "bar"
        ) %>%
  layout(title = "Owner Occupied",
         xaxis = list(categoryorder = "category descending"),
         yaxis = list(title = "Single Family Dwellings")
         )
```

### Bedrooms
    
```{r}
bed_rms <- table(pa$BED_RMS)
bed_rms_names <- as.numeric(names(bed_rms))

plot_ly(x = bed_rms_names,
        y = as.numeric(bed_rms),
        type = "bar"
        )%>%
  layout(title = "Bedrooms",
         xaxis = list(title = "Bedrooms",
                      tickvals = seq(1:max(bed_rms_names))),
         yaxis = list(title = "Single Family Dwellings")
  )
```

### Correlations
    
```{r}
# LIVING_AREA, Property_Age, BD, BTH & TOTAL_VALUE 
data <- pa[c(6,20,9,10,17)]
pairs(data)
```


City & Total Value {data-orientation=rows}
=====================================

Row
-------------------------------------
    
### City & Total Assessment Values

Overall, it is easy to see that single family dwellings in the city of Boston have the highest median total assessment values and the greatest change in dollar amount from FY2015 to FY2021.

The upper fence of values in FY2021 for the city of Boston is $7.7M - greater than any outliers for any other cities. Likewise, the upper fence for change in amount from 2015 to 2021 is $2.5M - again, greater than any outliers for any other cities.

However, it is worth noting that when considering the percent change between the years, the median percent change is actually greater for other cities.

### 2021 Total Assessment Values
    
```{r}
# CITY & TOTAL_VALUE 
fig <- plot_ly(pa, y = ~VALUE_2021, color = ~CITY, type = "box")
fig <- fig %>% layout(yaxis = list(title = ""))
fig
``` 

Row
-------------------------------------
    
### Total Assessment Value Change from 2015 to 2021 ($)
    
```{r}
fig <- plot_ly(pa, y = ~AMT_CHANGE, color = ~CITY, type = "box")
fig <- fig %>% layout(yaxis = list(title = "Change in Dollars"))
fig
```
    
### Total Assessment Value Change from 2015 to 2021 (%)

```{r}
fig <- plot_ly(pa, y = ~PCT_CHANGE, color = ~CITY, type = "box")
fig <- fig %>% layout(yaxis = list(title = "Change in Percentage"))
fig
```

Total Value {data-orientation=columns}
=====================================

Column
-------------------------------------
    
### Total Assessment Values {data-height=90}

After drawing 6000 samples with different sizes from 300 to 600, the averages of sample means are showing normal distributions. Total assessment values of 2021 does have applicability of the Central Limit Theorem.

### Distribution of FY2021 Total Assessment Values
    
```{r}
# TOTAL_VALUE 
fig <- plot_ly(pa, x = ~VALUE_2021, type = "histogram")
fig <- fig %>% layout(xaxis = list(title = ""))
fig
``` 

Column
-------------------------------------

### Central Limit Theorem

```{r}
# TOTAL_VALUE 
options(scipen=999)
par(mfrow=c(2,2))

c <- c()

set.seed(2)

for (size in c(300, 400, 500, 600)) {
  for (i in 1:6000) {
    c[i] <- mean( sample(pa$VALUE_2021, size = size, replace = F) )
  }
  
  hist(c, prob = T, xlab="inputs",
       xlim=c(500000,950000), ylim=c(0,0.000014),
       main = paste("Sample Size =", size))
}
```

Sampling {data-orientation=columns}
=====================================     
   
Column 
-------------------------------------
    
### Sampling FY2021 Total Assessment Values {data-height=250}

| Sample (size = 500) | Mean of Total Assessment Values |
| ---------- | -- |
| Full data set | $688,744 |
| Simple random sampling without replacement | $727,306 |
| Simple random sampling with replacement | $680,200 |
| Systematic sampling | $668,534 |
| Stratified sampling by City | $713,485 |

- If using SRS without replacement instead of the whole data set, the mean of the assessment values would be much larger.
- If using SRS with replacement, the mean of assessment values would be smaller but not that much.
- With systematic sampling, the mean is less. The difference is smaller than SRS without replacement, but greater than SRS with replacement.
- With stratified sampling by city, the mean is greater, but less than SRS with replacement.


### Distribution of 2021 Total Assessment Values {data-height=150}

```{r}

fig <- plot_ly(pa, x = ~VALUE_2021, type = "histogram", histnorm='probability')
fig <- fig %>% layout(xaxis = list(title = ""))
fig
```   


Column
-------------------------------------

### Simple Random Sampling Without Replacement: ATotal Assessment Values

```{r}

library(sampling)
library(UsingR)

set.seed(2)

N <- nrow(pa)
n <- 500

srs <- srswor(n,N)

sample <- pa[srs!= 0,]

fig <- plot_ly(sample, x = ~VALUE_2021, type = "histogram", histnorm='probability')
fig <- fig %>% layout(xaxis = list(title = ""))
fig
```

### Simple Random Sampling With Replacement: Total Assessment Values

```{r}

set.seed(2)

srs2 <- srswr(n,N)
sample2 <- pa[srs2 != 0, ]

fig <- plot_ly(sample2, x = ~VALUE_2021, type = "histogram", histnorm='probability')
fig <- fig %>% layout(xaxis = list(title = ""))
fig
```   

### Systematic Sampling: Total Assessment Values
    
```{r}

N <- nrow(pa)
n <- 500
k <- floor(N / n)
r <- sample(k, 1)

# select every kth item
s <- seq(r, by = k, length = n)
sample.systematic <- pa[s, ]

fig <- plot_ly(sample.systematic, x = ~VALUE_2021, type = "histogram", histnorm='probability')
fig <- fig %>% layout(xaxis = list(title = ""))
fig
```

### Stratified Sampling by City: Total Assessment Values
    
```{r, results='hide'}

set.seed(42)
pa <- pa[order(pa$CITY), ]
freq <- table(pa$CITY)

# Drop groups that are too small
pa2 <- pa[!(pa$CITY=="BROOKLINE" | pa$CITY=="DEDHAM"),]

freq <- table(pa2$CITY)
st.sizes <- 500 * freq / sum(freq)
st <- sampling::strata(pa2, stratanames = c("CITY"),
                       size = st.sizes, method = "srswor",
                       description = TRUE)

sample.stratified <- getdata(pa2, st)
```

```{r}
fig <- plot_ly(sample.stratified, x = ~VALUE_2021, type = "histogram", histnorm='probability')
fig <- fig %>% layout(xaxis = list(title = ""))
fig
```


Sampling (City) {data-orientation=columns}
=====================================     
   
Column 
-------------------------------------
    
### Sampling FY2021 Total Assessment Values: City Representation {data-height=150}

These charts show the representation of the cties in each of our samples.

It is worth noting that neither Brookline or Dedham are represented in any of the samples. Both had very few records in the original data set.

Otherwise, all sampling methods actually do well at representing the cities proportionally.

### City Representation {data-height=200}

```{r}
# Simple random sampling without replacement
# City Representation
fig <- plot_ly(pa, x = ~CITY, type = "histogram", histnorm='probability')
fig <- fig %>% layout(xaxis = list(title = ""))
fig
```


Column
-------------------------------------

### Simple Random Sampling Without Replacement: City Representation

```{r}
fig <- plot_ly(sample, x = ~CITY, type = "histogram", histnorm='probability')
fig <- fig %>% layout(xaxis = list(title = ""))
fig
```

### Simple Random Sampling With Replacement: City Representation

```{r}
fig <- plot_ly(sample2, x = ~CITY, type = "histogram", histnorm='probability')
fig <- fig %>% layout(xaxis = list(title = ""))
fig

```   

### Systematic Sampling: City Representation
    
```{r}
fig <- plot_ly(sample.systematic, x = ~CITY, type = "histogram", histnorm='probability')
fig <- fig %>% layout(xaxis = list(title = ""))
fig
```

### Stratified Sampling by City: City Representation
    
```{r}
fig <- plot_ly(sample.stratified, x = ~CITY, type = "histogram", histnorm='probability')
fig <- fig %>% layout(xaxis = list(title = ""))
fig
```